{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CPSC 330 hw7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instructions\n",
    "rubric={points:5}\n",
    "\n",
    "Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1: time series prediction\n",
    "\n",
    "In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset. As usual, please do not commit it to your repos."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"avocado.csv\", parse_dates=[\"Date\"], index_col=0)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"Date\"].min()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"Date\"].max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like the data ranges from the start of 2015 to March 2018 (~2 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "split_date = '20170925'\n",
    "df_train = df[df[\"Date\"] <= split_date]\n",
    "df_test  = df[df[\"Date\"] >  split_date]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(df_train) + len(df_test) == len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(a)\n",
    "rubric={points:3}\n",
    "\n",
    "In the Rain is Australia dataset from lecture, we had different measurements for each Location. What about this dataset: for which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(b)\n",
    "rubric={points:3}\n",
    "\n",
    "In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(c)\n",
    "rubric={points:1}\n",
    "\n",
    "In the Rain is Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are also all distinct, or are there overlapping regions? Justify your answer by referencing the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the entire dataset despite any location-based weirdness uncovered in the previous part.\n",
    "\n",
    "We will be trying to forecast the avocado price, which is the `AveragePrice` column. The function below is adapted from Lecture 17, with some improvements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_lag_feature(df, orig_feature, lag, groupby, new_feature_name=None, clip=False):\n",
    "    \"\"\"\n",
    "    Creates a new feature that's a lagged version of an existing one.\n",
    "    \n",
    "    NOTE: assumes df is already sorted by the time columns and has unique indices.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    df : pandas.core.frame.DataFrame\n",
    "        The dataset.\n",
    "    orig_feature : str\n",
    "        The column name of the feature we're copying\n",
    "    lag : int\n",
    "        The lag; negative lag means values from the past, positive lag means values from the future\n",
    "    groupby : list\n",
    "        Column(s) to group by in case df contains multiple time series\n",
    "    new_feature_name : str\n",
    "        Override the default name of the newly created column\n",
    "    clip : bool\n",
    "        If True, remove rows with a NaN values for the new feature\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    pandas.core.frame.DataFrame\n",
    "        A new dataframe with the additional column added.\n",
    "    \"\"\"\n",
    "        \n",
    "    if new_feature_name is None:\n",
    "        if lag < 0:\n",
    "            new_feature_name = \"%s_lag%d\" % (orig_feature, -lag)\n",
    "        else:\n",
    "            new_feature_name = \"%s_ahead%d\" % (orig_feature, lag)\n",
    "    \n",
    "    new_df = df.assign(**{new_feature_name : np.nan})\n",
    "    for name, group in new_df.groupby(groupby):        \n",
    "        if lag < 0: # take values from the past\n",
    "            new_df.loc[group.index[-lag:],new_feature_name] = group.iloc[:lag][orig_feature].values\n",
    "        else:       # take values from the future\n",
    "            new_df.loc[group.index[:-lag], new_feature_name] = group.iloc[lag:][orig_feature].values\n",
    "            \n",
    "    if clip:\n",
    "        new_df = new_df.dropna(subset=[new_feature_name])\n",
    "        \n",
    "    return new_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first sort our dataframe properly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_sort = df.sort_values(by=[\"region\", \"type\", \"Date\"]).reset_index(drop=True)\n",
    "df_sort"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_hastarget = create_lag_feature(df_sort, \"AveragePrice\", +1, [\"region\", \"type\"], \"AveragePriceNextWeek\", clip=True)\n",
    "df_hastarget"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I will now split the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train = df_hastarget[df_hastarget[\"Date\"] <= split_date]\n",
    "df_test  = df_hastarget[df_hastarget[\"Date\"] >  split_date]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(d)\n",
    "rubric={points:1}\n",
    "\n",
    "Next we will want to build some models to forecast the average avocado price a week in advance. Before we start with any ML, let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of \"AveragePriceNextWeek\" exactly equal to \"AveragePrice\", assuming no change. That is kind of like saying, \"If it's raining today then I'm guessing it will be raining tomorrow\". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse that this, it must not be very good. \n",
    "\n",
    "Using this baseline approach, what $R^2$ do you get?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(e)\n",
    "rubric={points:10}\n",
    "\n",
    "Build some models to forecast the average avocado price. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.\n",
    "\n",
    "Benchmark: you should be able to achieve $R^2$ of at least 0.79 on the test set. I got to 0.80, but not beyond that. Let me know if you do better!\n",
    "\n",
    "Note: because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(f)\n",
    "rubric={points:3}\n",
    "\n",
    "We talked a little bit about _seasonality_, which is the idea of a periodic component to the time series. For example, in lecture we attempted to capture this by encoding the month. Something we didn't discuss much is _trends_, which are long-term variations in the quantity of interest. Aside from the effects of climate change, the amount of rain in Australia is likely to vary during the year but less likely to have long-term trends over the years. Avocado prices, on the other hand, could easily exhibit trends: for example avocados may just cost more in 2020 than they did in 2015.\n",
    "\n",
    "Briefly discuss in ~1 paragraph: to what extent, if any, was your model above able to account for seasonality? What about trends?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2: very short answer questions\n",
    "\n",
    "Each question is worth 2 points."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(a)\n",
    "rubric={points:4}\n",
    "\n",
    "The following questions pertain to Lecture 17 on time series data:\n",
    "\n",
    "1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.\n",
    "2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2(b)\n",
    "rubric={points:6}\n",
    "\n",
    "The following questions pertain to Lecture 18 on survival analysis. We'll consider the use case of customer churn analysis.\n",
    "\n",
    "1. What is the problem with simply labeling customers are \"churned\" or \"not churned\" and using standard supervised learning techniques, as we did in hw4?\n",
    "2. Consider customer A who just joined last week vs. customer B who has been with the service for a year. Who do you expect will leave the service first: probably customer A, probably customer B, or we don't have enough information to answer?\n",
    "3. If a customer's survival function is almost flat during a certain period, how do we interpret that?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submission to Canvas\n",
    "\n",
    "**IF YOU ARE WORKING WITH A PARTNER** please form the group before submitting - see instructions [here](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#partners).\n",
    "\n",
    "When you are ready to submit your assignment do the following:\n",
    "\n",
    "1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`.\n",
    "2. Save your notebook.\n",
    "3. Convert your notebook to `.html` format using the `convert_notebook()` function below **or** by `File -> Export Notebook As... -> Export Notebook to HTML`.\n",
    "4. Run the code `submit()` below to go through an interactive submission process to Canvas.\n",
    ">For this step, you will need a Canvas *Access Token* token. If you haven't already got one, log-in to Canvas, click `Account` (top-left of the screen), then `Settings`, then scroll down until you see the `+ New Access Token` button. Click that button, give your token any name you like and set the expiry date to Dec 31, 2020. Then click `Generate token`. Save this token in a safe place on your computer as you'll need it for all assignments. Treat the token with as much care as you would an important password. \n",
    "\n",
    "Note: for those having trouble with the Jupyter widgets and the dropdowns: if you add the argument `no_widgets=True` to your `submit` call, it should let you do a text-based entry of your key and avoid the dropdowns altogether. If this doesn't work, you probably need to upgrade to the latest version of `canvasutils` with `pip install canvasutils -U` from your terminal with your environment activated.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from canvasutils.submit import submit, convert_notebook\n",
    "\n",
    "# Note: the canvasutils package should have been installed as part of your environment setup - \n",
    "# see https://github.com/UBC-CS/cpsc330/blob/master/docs/setup.md"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert_notebook(\"hw7.ipynb\", \"html\")  # uncomment and run when you want to try convert your notebook to HTML (or you can convert manually from the File menu)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# submit(course_code=53561, token=False)  # uncomment and run when ready to submit "
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "name": "_merged",
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "navigate_num": "#000000",
    "navigate_text": "#333333",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700",
    "sidebar_border": "#EEEEEE",
    "wrapper_background": "#FFFFFF"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "438px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
